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Abstract 


We propose a novel Bayesian nonparametric approach to learning with probabilis- 
tic deterministic finite automata (PDFA). We define and develop a sampler for a 
PDFA with an infinite number of states which we call the probabilistic determin- 
istic infinite automata (PDIA). Posterior predictive inference in this model, given 
a finite training sequence, can be interpreted as averaging over multiple PDFAs of 
varying structure, where each PDFA is biased towards having few states. We sug- 
gest that our method for averaging over PDFAs is a novel approach to predictive 
distribution smoothing. We test PDIA inference both on PDFA structure learning 
and on both natural language and DNA data prediction tasks. The results suggest 
that the PDIA presents an attractive compromise between the computational cost 
of hidden Markov models and the storage requirements of hierarchically smoothed 
Markov models. 


1 Introduction 


The focus of this paper is a novel Bayesian framework for learning with probabilistic deterministic 
finite automata (PDFA) [9]. A PDFA is a generative model for sequential data (PDFAs are reviewed 
in Section 2). Intuitively a PDFA is similar to a hidden Markov model (HMM) [10] in that it 
consists of a set of states, each of which when visited emits a symbol according to an emission 
probability distribution. It differs from an HMM in how state-to-state transitions occur; transitions 
are deterministic in a PDFA and nondeterministic in an HMM. 


In our framework for learning with PDFAs we specify a prior over the parameters of a single large 
PDFA that encourages state reuse. The inductive bias introduced by the PDFA prior provides a soft 
constraint on the number of states used to generate the data. We take the limit as the number of states 
becomes infinite, yielding a model we call the probabilistic deterministic infinite automata (PDIA). 


Given a finite training sequence, the PDIA posterior distribution is an infinite mixture of PDFAs. 
Samples from this distribution form a finite sample approximation to this infinite mixture, and can 
be drawn via Markov chain Monte Carlo (MCMC) [6]. Using such a mixture we can average over 
our uncertainty about the model parameters (including state cardinality) in a Bayesian way during 
prediction and other inference tasks. We find that averaging over a finite number of PDFAs trained 
on naturalistic data leads to better predictive performance than using a single “best” PDFA. 


We chose to investigate learning with PDFAs because they are intermediate in expressive power be- 
tween HMMs and finite-order Markov models, and thus strike a good balance between generaliza- 
tion performance and computational efficiency. A single PDFA is known to have relatively limited 
expressivity. We argue that a finite mixture of PDFAs has greater expressivity than that of a single 
PDFA but is not as expressive as a probabilistic nondeterministic finite automata (PNFA)!. A PDIA 
is clearly highly expressive; an infinite mixture over the same is even more so. Even though ours is 
a Bayesian approach to PDIA learning, in practice we only ever deal with a finite approximation to 
the full posterior and thus limit our discussion to finite mixtures of PDFAs. 


'PNFAs with no final probability are equivalent to hidden Markov models [3] 


While model expressivity is a concern, computational considerations often dominate model choice. 
We show that prediction in a trained mixture of PDFAs can have lower asymptotic cost than forward 
prediction in the PNFA/HMM class of models. We also present evidence that averaging over PDFAs 
gives predictive performance superior to HMMs trained with standard methods on naturalistic data. 
We find that PDIA predictive performance is competitive with that of fixed-order, smoothed Markov 
models with the same number of states. While sequence learning approaches such as the HMM 
and smoothed Markov models are well known and now highly optimized, our PDIA approach to 
learning is novel and is amenable to future improvement. 


Section 2 reviews PDFAs, Section 3 introduces Bayesian PDFA inference, Section 4 presents ex- 
perimental results on DNA and natural language, and Section 5 discusses related work on PDFA 
induction and the theoretical expressive power of mixtures of PDFAs. In Section 6 we discuss ways 
in which PDIA predictive performance might be improved in future research. 


2 Probabilistic Deterministic Finite Automata 


A PDFA is formally defined as a 5-tuple M = (Q,%, ô, 7, qo), where Q is a finite set of states, Nis a 
finite alphabet of observable symbols, ô : Q x © — Q is the transition function from a state/symbol 
pair to the next state, 7 : Q x © — (0, 1] is the probability of the next symbol given a state and qo is 
the initial state.2 Throughout this paper we will use 7 to index elements of Q, j to index elements of 
X, and ¢ to index elements of an observed string. For example, 6;; is shorthand for ô (qi, cj), where 
qi E Q and gj € X. 


Given a state q;, the probability that the next symbol takes the value ø; is given by m(q;,0;). We 
use the shorthand 7r,, for the state-specific discrete distribution over symbols for state q;. We can 
also write o|q; ~ Tq; where o is a random variable that takes values in X. Given a state q; and a 
symbol gj, however, the next state q; is deterministic: qy = (qi, oj). Generating from a PDFA 
involves first generating a symbol stochastically given the state the process is in: x;¢|& ~ mg, where 
& € Q is the state at time t. Next, given é; and x; transitioning deterministically to the next state: 
&:41 = Olé, x+). This is the reason for the confusing “probabilistic deterministic” name for these 
models. Turning this around, given data, qo, and ô, there is no uncertainty about the path through 
the states. This is a primary source of computational savings relative to HMMs. 


PDFAs are more general than nth-order Markov models (i.e. m-gram models, m = n + 1), but less 
expressive than hidden Markov models (HMMs)[3]. For the case of nth-order Markov models, we 
can construct a PDFA with one state per suffix 7172...%,. Given a state and a symbol zn+1, the 
unique next state is the one corresponding to the suffix v2... £n+1. Thus nth-order Markov models 
are a subclass of PDFAs with O(|£|”) states. For an HMM, given data and an initial distribution 
over states, there is a posterior probability for every path through the state space. PDFAs are those 
HMMs for which, given a unique start state, the posterior probability over paths is degenerate at a 
single path. As we explain in Section 5, mixtures of PDFAs are strictly more expressive than single 
PDFAs, but still less expressive than PNFAs. 


3 Bayesian PDFA Inference 


We start our description of Bayesian PDFA inference by defining a prior distribution over the pa- 
rameters of a finite PDFA. We then show how to analytically marginalize nuisance parameters out 
of the model and derive a Metropolis-Hastings sampler for posterior inference using the resulting 
collapsed representation. We discuss the limit of our model as the number of states in the PDFA goes 
to infinity. We call this limit the probabilistic deterministic infinite automaton (PDIA). We develop 
a PDIA sampler that carries over from the finite case in a natural way. 


3.1 A PDFA Prior 


We assume that the set of states Q, set of symbols ©, and initial state go of a PDFA are known but 
that the transition and emission functions are unknown. The PDFA prior then consists of a prior 
over both the transition function ô and the emission probability function 7. In the finite case ô and 


?In general qo may be replaced by a distribution over initial states. 


m are representable as finite matrices, with one column per element of X and one row per element 
of Q. For each column j (j co-indexes columns and set elements) of the transition matrix 6, our 
prior stipulates that the elements of that column are i.i.d. draws from a discrete distribution ø; over 
Q, that is, d5; ~ [6 1,---, jy], 0 < i < |Q| — 1. The $; represent transition tendencies given 
a symbol, if the ith element of @, is large then state q; is likely to be transitioned to anytime the 
last symbol was oj. The @,’s are themselves given a shared Dirichlet prior with parameters ap, 
where a is a concentration and yz is a template transition probability vector. If the ¿ith element of pz 
is large then the ith state is likely to be transitioned to regardless of the emitted symbol. We place 
a uniform Dirichlet prior on p itself, with y total mass and average over p during inference. This 
hierarchical Dirichlet construction encourages both general and context specific state reuse. We also 
place a uniform Dirichlet prior over the per-state emission probabilities 74, with ( total mass which 
smooths emission distribution estimates. Formally: 


Hy, 1Q| ~ Dir(7/[QI,..-,7/IQl) (1) 
Pla, ~ Dir(ap) (2) 
Talb E] ~ Dir(@/|X],...,2/|XI) 

big ~~ $j 


where 0 < i < |Q| — l and1 < j < |X|. Given a sample from this model we can run the PDFA 
to generate a sequence of T symbols. Using €; to denote the state of the PDFA at position t in the 
sequence: 


£0 = qo, Lo ~ Tq; Et = 06 1) 21-1) Ly ~ Te, 


We choose this particular inductive bias, with transitions tied together within a column of 6, because 
we wanted the most recent symbol emission to be informative about what the next state is. If we 
instead had a single Dirichlet prior over all elements of 6, transitions to a few states would be highly 
likely no matter the context and those states would dominate the behavior of the automata. If we 
tied together rows of 6 instead of columns, being in a particular state would tell us more about the 
sequence of states we came from than the symbols that got us there. 


Note that this prior stipulates a fully connected PDFA in which all states may transition to all others 
and all symbols may be emitted from each state. This is slightly different that the canonical finite 
state machine literature where sparse connectivity is usually the norm. 


3.2 PDFA Inference 


Given observational data, we are interested in learning a posterior distribution over PDFAs. We do 
this by Glbbs sampling the transition matrix 6 with m and @, integrated out. To start inference we 
need the likelihood function for a fixed PDFA; it is given by 


T 
p(xo.r|m, 8) = (£o, xo) | | rlé xe). 


t=1 


Remember that €,|£;_1, 7,1 is deterministic given the transition function 6. We can marginalize 7 
out of this expression and express the likelihood of the data in a form that depends only on the counts 
of symbols emitted from each state. Define the count matrix c for the sequence 9.7 and transition 
matrix ô as Cij = soar Tij (Er, £t), where I;;(,v,) is an indicator function for the automaton 
being in state q; when it generates x+, i.e. € = qi and a, = oj. This matrix c = [cis] gives the 
number of times each symbol is emitted from each state. Due to multinomial-Dirichlet conjugacy 
we can express the probability of a sequence given the transition function 6, the count matrix c and 
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If the transition matrix 6 is observed we have a closed-form expression for its likelihood given ps 
with all @,’s marginalized out. Let v;; be the number of times state q; is transitioned to given that 


a; was the last symbol emitted, i.e. v;; is the number of times 6;-; = q; for all states i’ in the column 


ian = J E T 3) 


j. The marginal likelihood of ô in terms of p is then: 


[>| 


7 2-1 Dan. tu., 
p(d|u.a) = / palomo = Tarr e o i 


(4) 


We perform posterior inference in the finite model by sampling elements of 6 and the vector u. One 
can sample 6;; given the rest of the matrix d_;; using 


P(6ij|6—43, Lo:7, H, a) x P(To:7|5i5, 6-43) (ig l-i; H, a) (5) 


Both terms on the right hand side of this equation have closed-form expressions, the first given in 
(3). The second can be found from (4) and is 


Qhi’ + Vil 5 


e101 (6) 


P(6i5 = qir|O-i3, 0, u) = 


where vy; is the number of elements in column j equal to q; excluding 6;;. As |Q] is finite, 
we compute (5) for all values of 6;; and normalize to produce the required conditional probability 
distribution. 


Note that in (3), the count matrix c may be profoundly impacted by changing even a single element 
of 6. The values in c depend on the specific sequence of states the automata used to generate x. 
Changing the value of a single element of 6 affects the state trajectory the PDFA must follow to 
generate xo.7. Among other things this means that some elements of c that were nonzero may 
become zero, and vice versa. 


We can reduce the computational cost of inference by deleting transitions 6;; for which the corre- 
sponding counts c;; become 0. In practical sampler implementations this means that one need not 
even represent transitions corresponding to zero counts. The likelihood of the data (3) does not de- 
pend on the value of ĝ;; if symbol ø; is never emitted while the machine is in state q;. In this case 
sampling from (5) is the same as sampling without conditioning on the data at all. Thus, if while 
sampling we change some transition that renders c;; = 0 for some values for each of 7 and j, we can 
delete 6;; until another transition is changed such that c;; becomes nonzero again, when we sample 
ðij anew. Under the marginal joint distribution of a column of ô the row entries in that column are 
exchangeable, and so deleting an entry of 6 has the same effect as marginalizing it out. When all 
ij for some state q; are marginalized out, we can say the state itself is marginalized out. When 
we delete an element from a column of ô, we replace the |Q| — 1 in the denominator of (6) with 


D} = per I(v;; # 0), the number of entries in the jth column of 6 that are not marginalized 
out yielding 
Qhi + Vij 


7 
a+ D} 7) 


Poi; = Qi |O— ij, &, H) = 
If when sampling ĝ;; it is assigned it a state q; such that some c; ; which was zero is now nonzero, 
we simply reinstantiate 4;;, by drawing from (7) and update Dj. When sampling a single 4;; 
there can be many such transitions as the path through the machine dictated by xo.7 may use many 
transitions in 6 that were deleted. In this case we update incrementally, increasing D} and vij as we 
go. 


While it is possible to construct a Gibbs sampler using (5) in this collapsed representation, such a 
sampler requires a Monte Carlo integration over a potentially large subset of the marginalized-out 
transitions in ô, which may be costly. A simpler strategy is to pretend that all entries of ô exist but 
are sampled in a “just-in-time” manner. This gives rise to a Metropolis Hastings (MH) sampler for 6 
where the proposed value for ĝ;; is either one of the instantiated states or any one of the equivalent 
marginalized out states. Any time any marginalized out element of ô is required we can pretend as 
if we had just sampled its value, and we know that because its value had no effect on the likelihood 
of the data, we know that it would have been sampled directly from (7). It is in this sense that all 
marginalized out states are equivalent — we known nothing more about their connectivity structure 
than that given by the prior in (7). 


For the MH sampler, denote the set of non-marginalized out 6 entries 5+ = {bij KES 0}. We 
propose a new value q;» for one 6;; € 5* according to (7). The conditional posterior probability 


PDIA PDIA-MAP HMM-EM bigram trigram 4-gram 5-gram 6-gram SSM 
AIW | 5.13 5.46 7.89 9.71 6.45 5.13 4.80 4.69 4.78 
365.6 379 52 28 382 2,023 5,592 10,838 19,358 
DNA | 3.72 3.72 3.76 3.77 3.75 3.74 3.73 3.72 3.56 
64.7 54 19 5 21 85 341 1,365 314,166 


Table 1: PDIA inference performance relative to HMM and fixed order Markov models. Top rows: 
perplexity. Bottom rows: number of states in each model. For the PDIA this is an average number. 


of this proposal is proportional to p(xo:r|ĝi; = gir tiz) P (iy = qir 7, ,)- The Hastings cor- 
rection exactly cancels out the proposal probability in the accept/reject ratio leaving an MH accept 
probability for the ð;; being set to q;- given that its previous value was qj of 


p(xo:T|ðij = dix, O54) 
p(xo:7 |i; = qi, 0 ;;) 


adi; qi* (8) 


bij = Gi) nin (1, 


Whether qi» is marginalized out or not, evaluating P(Xo:7|5i5 = qi, oe j) may require reinstantiat- 
ing marginalized out elements of ô. As before, these values are sampled from (7) on a just-in-time 
schedule. If the new value is accepted, all ô;; € 6 + for which Cij = 0 are removed, and then move 
to the next transition in 6 to sample. 


In the finite case, one can sample yz by Metropolis-Hastings or use a MAP estimate as in [7]. Hy- 
perparameters a, 3 and y can be sampled via Metropolis-Hastings updates. In our experiments we 
use Gamma(1,1) hyperpriors. 


3.3 The Probabilistic Deterministic Infinite Automaton 


We would like to avoid placing a strict upper bound on the number of states so that model complexity 
can grow with the amount of training data. To see how to do this, consider what happens when 
|Q| — co. In this case, the right hand side of equations (1) and (2) must be replaced by infinite 
dimensional alternatives 
H XN PY (7; do, H ) 

Pj a PY (a, d, u) 

Oi S Pj 
where PY stands for Pitman Yor process and H in our case is a geometric distribution over the 
integers with parameter À. The resulting hierarchical model becomes the hierarchical Pitman-Yor 
process (HPYP) over a discrete alphabet [14]. The discount parameters do and d are particular to the 
infinite case, and when both are zero the HPYP becomes the well known hierarchical Dirichlet pro- 
cess (HDP), which is the infinite dimensional limit of (1) and (2) [15]. Given a finite amount of data, 
there can only be nonzero counts for a finite number of state/symbol pairs, so our marginalization 
procedure from the finite case will yield a 6 with at most T elements. Denote these non-marginalized 
out entries by 6+. We can sample the elements of 6* as before using (8) provided that we can pro- 
pose from the HPYP. In many HPYP sampler representations this is easy to do. We use the Chinese 
restaurant franchise representation [15] in which the posterior predictive distribution of 6;; given 


+ . . 
ôt; j can be expressed with @, and p integrated out as 


, | Vig — ky d at kid Wy! — kyd + kd 
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a+ D} a+ D} y+ w. y+ w. 


where wy, kij, Ki, w. = Do; Wi, kj = DD, kij, and K. = D7, k; are stochastic bookkeeping counts 
required by the Chinese Restaurant franchise sampler. These counts must themselves be sampled 
[15]. The discount hyperparameters can also be sampled by Metropolis-Hastings. 


4 Experiments and Results 


To test our PDIA inference approach we evaluated it on discrete natural sequence prediction and 
compared its performance to HMMs and smoothed n-gram models. We trained the models on two 
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Figure 1: Subsampled PDIA sampler trace for Alice in Wonderland. The top trace is the joint log 
likelihood of the model and training data, the bottom trace is the number of states. 


datasets: a character sequence from Alice in Wonderland [2] and a short sequence of mouse DNA. 
The Alice in Wonderland (AIW) dataset was preprocessed to remove all characters but letters and 
spaces, shift all letters from upper to lower case, and split along sentence dividers to yield a 27- 
character alphabet (a-z and space). We trained on 100 random sentences (9,986 characters) and 
tested on 50 random sentences (3,891 characters). The mouse DNA dataset consisted of a fragment 
of chromosome 2 with 194,173 base pairs, which we treated as a single unbroken string. We used 
the first 150,000 base pairs for training and the rest for testing. For AIW, the state of the PDIA 
model was always set to qo at the start of each sentence. For DNA, the state of the PDIA model at 
the start of the test data was set to the last state of the model after accepting the training data. We 
placed Gamma(1,1) priors over a, 8 and y, set A = .001, and used uniform priors for dp and d. 


We evaluated the performance of the learned models by calculating the average per character pre- 
dictive perplexity of the test data. For training data xı:r and test data yı:r> this is given by 
Q~ 77 1082 P(vi.r/l®u-7) Tt ig a measure of the average uncertainty the model has about what character 
comes next given the sequence up to that point, and is at most |%|. We evaluated the probability of 
the test data incrementally, integrating the test data into the model in the standard Bayesian way. 


Test perplexity results are shown in Table 1 on the first line of each subtable. Each sample passed 
through every instantiated transition. Every fifth sample for AIW and every tenth sample for DNA 
after burn-in was used for prediction. For AIW, we ran 15,000 burn-in samples and used 3,500 
samples for predictive inference. Subsampled sampler diagnostic plots are shown in Figure 1 that 
demonstrate the convergence properties of our sampler. When modeling the DNA dataset we burn-in 
for 1,000 samples and use 900 samples for inference. For the smoothed n-gram models, we report 
thousand-sample average perplexity results for hierarchical Pitman- Yor process (HPYP) [14] models 
of varying Markov order (1 through 5 notated as bigram through 6-gram) after burning each model 
in for one hundred samples. We also show the performance of the single particle incremental variant 
of the sequence memoizer (SM) [5], the SM being the limit of an n-gram model as n — oo. We also 
show results for a hidden Markov model (HMM) [8] trained using expectation-maximization (EM). 
We determined the best number of hidden states by cross-validation on the test data (a procedure 
used here to produce optimistic HMM performance for comparison purposes only). 


The performance of the PDIA exceeds that of the HMM and is approximately equal to that of 
a smoothed 4-gram model, though it does not outperform very deep, smoothed Markov models. 
This is in contrast to [16], which found that PDFAs trained on natural language data were able to 
predict as well as unsmoothed trigrams, but were significantly worse than smoothed trigrams, even 
when averaging over multiple learned PDFAs. As can be seen in the second line of each subtable 
in Table 1, the MAP number of states learned by the PDIA is significantly lower than that of the 
n-gram model with equal predictive performance. 


Unlike the HMM, the computational complexity of PDFA prediction does not depend on the number 
of states in the model because only a single path through the states is followed. This means that the 
asymptotic cost of prediction for the PDIA is O(LT’), where L is the number of posterior samples 
and T” is the length of the test sequence. For any single HMM it is O( KT”), where K is the number 
of states in the HMM. This is because all possible paths must be followed to achieve the given HMM 
predictive performance (although a subset of possible paths could be followed if doing approximate 
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Figure 2: Two PNFAs outside the class of PDFAs. (a) can be represented by a mixture of two 
PDFAs, one following the right branch from state 0, the other following the left branch. (b), in 
contrast, cannot be represented by any finite mixture of PDFAs. 


inference). In PDIA inference we too can choose the number of samples used for prediction, but 
here even a single sample has empirical prediction performance superior to averaging over all paths 
in an HMM. The computational complexity of smoothing n-gram inference is equivalent to PDIA 
inference, however, the storage cost for the large n-gram models is significantly higher than that of 
the estimated PDIA for the same predictive performance. 


5 Theory and Related Work 


The PDIA posterior distribution takes the form of an infinite mixture of PDFAs. In practice, we 
run a sampler for some number of iterations and approximate the posterior with a finite mixture 
of PDFAs. For this reason, we now consider the expressive power of finite mixtures of PDFAs. 
We show that they are strictly more expressive than PDFAs, but strictly less expressive than hidden 
Markov models. Probabilistic non-deterministic finite automata (PNFA) are a strictly larger model 
class than PDFAs. For example, the PNFA in 2(a) cannot be expressed as a PDFA [3]. However, 
it can be expressed as a mixture of two PDFAs, one with Q = {qo,q1,q3} and the other with 
Q = {q0, G2, 93}. Thus mixtures of PDFAs are a strictly larger model class than PDFAs. In general, 
any PNFA where the nondeterministic transitions can only be visited once can be expressed as a 
mixture of PDFAs. However, if we replace transitions to q3 with transitions to go, as in 2(b), there 
is no longer any equivalent finite mixture of PDFAs, since the nondeterministic branch from qo can 
be visited an arbitrary number of times. 


Previous work on PDFA induction has focused on accurately discovering model structure when the 
true generative mechanism is a PDFA. State merging algorithms do this by starting with the trivial 
PDFA that only accepts the training data and merging states that pass a similarity test [1, 17], and 
have been proven to identify the correct model in the limit of infinite data. State splitting algorithms 
start at the opposite extreme, with the trivial single-state PDFA, and split states that pass a difference 
test [12, 13]. These algorithms return only a deterministic estimate, while ours naturally expresses 
uncertainty about the learned model. 


To test if we can learn the generative mechanism given our inductive bias, we trained the PDIA on 
data from three synthetic grammars: the even process [13], the Reber grammar [11] and the Feldman 
grammar [4], which have up to 7 states and 7 symbols in the alphabet. In each case the mean number 
of states discovered by the model approached the correct number as more data was used in training. 
Results are presented in Figure 3. Furthermore, the predictive performance of the PDIA was nearly 
equivalent to the actual data generating mechanism. 


6 Discussion 


Our Bayesian approach to PDIA inference can be interpreted as a stochastic search procedure for 
PDFA structure learning where the number of states is unknown. In Section 5 we presented evidence 
that PDFA samples from our PDIA inference algorithm have the same characteristics as the true 
generative process. This in and of itself may be of interest to the PDFA induction community. 
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(d) Posterior marginal PDIA state cardinality distribution 


Figure 3: Three synthetic PDFAs: (a) even process [13], (b) Reber grammar [11], (c) Feldman 
grammar [4]. (d) posterior mean and standard deviation of number of states discovered during PDIA 
inference for varying amounts of data generated by each of the synthetic PDFAs. PDIA inference 
discovers PDFAs with the correct number of states 


We ourselves are more interested in establishing new ways to produce smoothed predictive con- 
ditional distributions. Inference in the PDIA presents a completely new approach to smoothing, 
smoothing by averaging over PDFA model structure rather than hierarchically smoothing related 
emission distribution estimates. Our PDIA approach gives us an attractive ability to trade-off be- 
tween model simplicity in terms of number of states, computational complexity in terms of asymp- 
totic cost of prediction, and predictive perplexity. While our PDIA approach may not yet outperform 
the best smoothing Markov model approaches in terms of predictive perplexity alone, it does out- 
perform them in terms of model complexity required to achieve the same predictive perplexity, and 
outperforms HMMs in terms of asymptotic time complexity of prediction. This suggests that a 
future combination of smoothing over model structure and smoothing over emission distributions 
could produce excellent results. PDIA inference gives researchers another tool to choose from when 
building models. If very fast prediction is desirable and the predictive perplexity difference between 
the PDIA and, for instance, the most competitive n-gram is insignificant from an application per- 
spective, then doing finite sample inference in the PDIA offers a significant computational advantage 
in terms of memory. 


We indeed believe the most promising approach to improving PDIA predictive performance is to 
construct a smoothing hierarchy over the state specific emission distributions, as is done in the 
smoothing n-gram models. For an n-gram, where every state corresponds to a suffix of the sequence, 
the predictive distributions for a suffix is smoothed by the predictive distribution for a shorter suffix, 
for which there are more observations. This makes it possible to increase the size of the model indef- 
initely without generalization performance suffering [18]. In the PDIA, by contrast, the predictive 
probabilities for states are not tied together. Since states of the PDIA are not uniquely identified 
by suffixes, it is no longer clear what the natural smoothing hierarchy is. It is somewhat surprising 
that PDIA learning works nearly as well as n-gram modeling even without a smoothing hierarchy 
for its emission distributions. Imposing a hierarchical smoothing of the PDIA emission distributions 
remains an open problem. 
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